How to Create a Boxplot

A boxplot (also called a box-and-whisker plot) is a graphical representation of the distribution of a dataset. It provides a summary using the five-number summary and helps identify outliers. In this lesson, we define a boxplot and outline the steps to construct one.

Boxplot

What is a Boxplot?

A boxplot visually displays the spread and skewness of a dataset using a box and two line segments (the whiskers). It is built using the five-number summary. 

  • a vertical line segment is drawn through \(Q_1\), \(Q_2\), and \(Q_3\).  Connect the ends of the lines to create a box.
  • A line is drawn from the center of the side defined by \(Q_1\) to the minimum value.
  • A line is drawn from the center of the side defined by \(Q_3\) to the maximal value.

Some boxplots also display outliers, but the version we are constructing does not do that.

How to Draw a Boxplot

Steps to Construct a Boxplot

Follow these steps to create a boxplot:

  1. Find the five-number summary.
  2. Draw the box: Plot a box from \( Q_1 \) to \( Q_3 \), with a vertical line at the median.
  3. Draw the whiskers: Extend lines from the box to the minimum and maximum values.

Example 1

This dataset represents the length of gaming sessions (in minutes) for 25 gamers who played an online multiplayer game over the weekend. Using the dataset below, construct a boxplot of gaming session lengths.

Gaming Session Lengths (Minutes)
Session Lengths (Minutes)
15 30 45 60 75
120 150 90 200 180
95 110 130 140 85
70 160 170 55 40
190 210 35 100 250

Solution

First, lets copy the data into the Summary Statistics Calculator and select the boxes for the Minimum Value, \(Q_1\), Median, \(Q_3\) and Maximum Value:

The five number summary for gaming times.

Therefore, we get \[\min = 15\quad Q_1=57.5\quad \text{{Median}}=100\quad Q_3=165\quad\max=250\] First, we will draw a number line, labeling the five-number summary to scale on the axis.  Next, we will draw a vertical line above \(Q_1\), the median, and \(Q_3\).

A number line labeled the five number summary. A vertical line segment is places about the middle three numbers.

Next, draw in the box.

Next, the box is drawn in.

Draw the left whisker from the box's center left edge to the minimum value. 

The left whisker has been drawn in.

Draw the right whisker from the box's center right edge to the maximum value.

The right whisker has been added. 

$$\tag*{\(\blacksquare\)}$$

Example 2

During exam week, researchers recorded the number of cups of coffee consumed daily by 30 college students. Use the Boxplot Generator to construct a boxplot for the coffee consumption.

Coffee Consumption (Cups per Day)
Number of Cups
0 1 2 3 5 3 4 6 7 2
3 4 5 1 0 8 3 6 4 5
2 7 3 4 6 2 1 5 3 4

Solution

Copy the data, open it in the Boxplot Generator, copy the data into the spreadsheet, and close the spreadsheet.  The boxplot should automatically generate in the tool.

A boxplot of the number of coffees consumed per day.

$$\tag*{\(\blacksquare\)}$$

Skewness in Boxplots

How to Identify Skewness in a Boxplot

A boxplot provides insight into whether a dataset is normally distributed, skewed left, or skewed right by examining the position of the median and the length of the whiskers.

  • A boxplot is normally distributed if the median is centered within the box (between \( Q_1 \) and \( Q_3 \)) and the whiskers on both sides are approximately equal in length.

    A boxplot of a normal distribution for free throw percentages.
  • A boxplot is left skewed if either the median is closer to \( Q_3 \), or the left whisker is longer than the right whisker.

    Skew-Left Dot Plot on the Number of Hours of Sleep a New Parent Gets
  • A boxplot is right skewed if either the median is closer to \( Q_1 \), or the right whisker is longer than the left whisker.

    A strong skew-right boxplot for student loan payments.

Comparing Datasets Using Boxplots

The Advantages of Boxplots for Comparison

Boxplots are an excellent tool for comparing multiple datasets because they provide a visual summary of key statistical measures while maintaining simplicity. Here’s why boxplots are useful for comparing data:

Side-by-Side Comparison of Distributions

Boxplots allow multiple datasets to be displayed together, making it easy to compare their centers, spreads, and skewness. This is particularly useful when comparing groups, such as test scores across different schools or income levels across regions.

Quick Insight into Variability

The interquartile range (IQR), represented by the width of the box, gives an immediate understanding of how spread out the middle 50% of the data is. If one dataset has a wider box than another, it has greater variability.

Easy Detection of Skewness

By examining the median’s position inside the box and the relative whisker lengths, boxplots make it easy to see if a dataset is symmetrical, right-skewed, or left-skewed. This helps in understanding differences between distributions, such as salary distributions in different industries.

Compact Yet Informative

Boxplots do not require a large amount of space and can be used effectively in reports or presentations to compare datasets quickly. They condense information into a single, easy-to-read visual while still capturing essential details about the data.

Example 3

These datasets represent the 100-meter sprint times (in seconds) for two groups: Olympic sprinters and high school sprinters. Use the Boxplot Generator for each dataset and compare their distributions.

100m Sprint Times: High School vs. Olympic Athletes
High School (s) Olympic (s)
10.55 10.60 10.65 10.72 10.78 9.58 9.69 9.72 9.76 9.81
10.82 10.85 10.89 10.94 10.98 9.85 9.88 9.91 9.93 9.95
11.02 11.07 11.10 11.14 11.18 9.98 10.01 10.03 10.05 10.08
11.21 11.25 11.29 11.35 11.40 10.12 10.15 10.19 10.22 10.25

Solution

Copy the data, open the Boxplot Generator, paste the data into the spreadsheet, and close the spreadsheet.  The plot for the high school sprint times should appear automatically. 

The boxplot of the high school sprint times.

To generate the second boxplot, change the number of datasets to Two, and set the second data set to column B. Then, the second plot will appear.

The boxplots for each set of sprint times are shown side-by-side.

As expected, the student times are slower than the Olympic times, but also notice that

  • the Olympic times are slightly skewed left while the high school students have a more normal distribution, which means there are a few Olympiads that significantly outperform the other athletes,
  • that the IQR for high school students is wider than for Olympiads; meaning that high school students have more variability in their times.

$$\tag*{\(\blacksquare\)}$$

Conclusion

Boxplots provide a visual summary of a dataset, highlighting its spread, skewness, and potential outliers. They are especially useful when comparing multiple populations such as high school and Olympic athletes that have their own levels of performance but a comparison of ability is still warranted.